Estimating high-dimensional directed acyclic graphs 
with the PC-algorithm 

Markus Kalisch and Peter Buhlmann* 
February 2, 2008 



Abstract 

We consider the PC-algor ithm (0) for estimating the skeleton 
of a very high-dimensional acyclic directed graph (DAG) with corre- 
sponding Gaussian distribution. The PC-algorithm is computationally 
feasible for sparse problems with many nodes, i.e. variables, and it has 
the attractive property to automatically achieve high computational 
efficiency as a function of sparseness of the true underlying DAG. We 
prove consistency of the algorithm for very high-dimensional, sparse 
DAGs where the number of nodes is allowed to quickly grow with 
sample size n, as fast as 0(n a ) for any < a < oo. The sparseness 
assumption is rather minimal requiring only that the neighborhoods 
in the DAG are of lower order than sample size n. We empirically 
demonstrate the PC-algorithm for simulated data and argue that the 
algorithm is rather insensitive to the choice of its single tuning param- 
eter. 

1 Introduction 

Graphical models are a popular probabilistic tool to analyze and visualize 
conditional independence relationships between random variables (see Q], 
[lfll|). Major building blocks of the models are nodes, which represent ran- 
dom variables and edges, which encode conditional dependence relations of 
*Both authors are affiliated with the Seminar fur Statistik, ETH Zurich, Switzerland. 
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the enclosing vertices. The structure of conditional independence among the 
random variables can be explored using the Markov properties. 

Of particular current interest are directed acyclic graphs (DAGs), con- 
taining directed rather than undirected edges, which restrict in a sense the 
conditional dependence relations. These graphs can be interpreted by apply- 
ing the directed Markov property. When ignoring the directions of a DAG, 
we get the skeleton of a DAG. In general, it is different from the conditional 
independence graph (CIG), see section r2.ll Thus, estimation methods for 
directed graphs cannot be easily borrowed from approaches for undirected 
CIGs. 

Estimation of a DAG from data is difficult and computationally non- 
trivial due to the enormous size of the space of DAGs: the number of possible 
DAGs is super-exponential in the number of nodes. Nevertheless, there are 
quite successful search-and-score methods for problems where the number of 
nodes is small or moderate. For example, the search space may be restricted 
to trees as in MWST (Maximum Weight Spanning Trees; see [s| and 0|), 
or a greedy search is employed. The greedy DAG search can be improved 
by exploiting probabilistic equivalence relations, and the search space can 
be reduced from individual DAGs to equivalence classes, as proposed in 
GES (Greedy Equivalent Search, see 0). Although this method seems quite 
promising when having few or a moderate number of nodes only, it is limited 
by the fact that the space of equivalence classes is conjectured to grow super- 
exponentially in the nodes as well (see Q). Bayesian approaches for DAGs, 
which are computationally very intensive, include 12] and 0]. 

An interesting alternative to greedy or structurally restricted approaches 
is the PC-algorithm from 13]. It starts from a complete, undirected graph 
and deletes recursively edges based on conditional independence decisions. 
This yields an undirected graph which can then be partially directed and 
further extended to DAGs. For the skeleton of a DAG, i.e. the undirected 
version of a DAG, the PC-algorithm runs in the worst case in exponential 
time (as a function of the number of nodes), but if the true underlying 
DAG is sparse, which is often a reasonable assumption, this reduces to a 
polynomial runtime. 

We focus in this paper on estimating DAGs in the high-dimensional 
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context when having many nodes, i.e. the number of nodes p may be much 
larger than sample size n. We prove that the PC-algorithm consistently 
estimates the skeleton of an underlying sparse DAG, as sample size n — ► oo, 
even if p = p n = 0(n a ) (0 < a < oo) is allowed to grow very quickly as a 
function of n. Our implementation of the PC-algorithm allows to estimate 
the skeleton of a sparse DAG even if p is in the hundreds or thousands. 
For the high-dimensional setting with p > n, sparsity of the underlying 
DAG is crucial for statistical consistency and computational feasibility. The 
PC-algorithm seems to be the only method for high-dimensional settings 
which is computationally feasible and, due to the new results in this paper, 
provably correct in an asymptotic sense. 

We argue empirically that the PC-algorithm is rather insensitive to the 
choice of its single tuning parameter, a significance level for testing, and we 
compare the PC-algorithm with other methods, at least for low- or mid- 
dimensional problems. 

2 The skeleton of a DAG 
2.1 Definitions and preliminaries 

A graph G = (V, E) consists of a set of nodes or vertices V = {1, . . . ,p} 
and a set of edges E C V x V, i.e. the edge set is a subset of ordered 
pairs of distinct nodes. In our setting, the set of nodes corresponds to the 
components of a random vector X £ W. An edge £ E is called 

directed if £ E but ^ E: we then use the notation i — > j. An 
acyclic directed graph (DAG) is a graph G where all edges are directed and 
not containing any cycle. 

If there is a directed edge i — > j, node % is said to be a parent of node 
j. The set of parents of node j is denoted by pa(j). The set of neighbors 
of a node j, denoted by ne(j), are all nodes i with a directed edge i — > j or 
j — > i. Equivalently, ne(j) is often referred to as the adjacency set adj(G,j) 
of a node j in the graph G. The skeleton of a DAG G is the undirected 
graph obtained from G by substituting undirected edges for directed edges. 

A probability distribution P on W is said to be faithful with respect to 
a graph G if conditional independencies of the distribution can be inferred 



3 



from d-separation in the graph G and vice-versa. More precisely: consider 
a random vector X ~ P. Faithfulness of P with respect to G means: for 
every set s C V, 

X w and X (i) are conditionally independent given 
4=> node i and node j are d-separated by the set s. 

The notion of d-separation can be defined via moral graphs; details are 
described in Prop. 3.25]. We remark here that faithfulness is ruling 
out some classes of probability distributions. An example of a non-faithful 



distribution is given in [13 . Chapter 3.5.2]. On the other hand, non- faithful 
distributions form a Lebesgue null-set in the space of distributions associated 
with a DAG G, see Q, Th. 3.2]. 

It is well known that for a probability distribution P which is generated 
from a DAG G, there is a whole equivalence class of DAGs with corre- 
sponding distribution P (see 0, Section 2.2 ]), and we can only identify an 
equivalence class of DAGs, even when having infinitely many observations. 
But the skeletons of DAGs from the same equivalence class are the same, 
and thus, inferring a skeleton from data is an easier and better identifiable 
task than aiming for directed graphs. We point out that in general, the 
skeleton of a DAG G with corresponding distribution P is different from 
the conditional independence graph corresponding to the distribution P. In 
particular, if P is faithful with respect to a DAG G, 

there is an edge between nodes i and j in the skeleton of DAG G 
44> for all s C V \ X^ and are conditionally independent 

given {X (r) ; res}, (1) 

fjlil. Th. 3.4]). This implies the following: if P is faithful with respect to a 
DAG G, the skeleton of the DAG G is a subset (or equal) to the conditional 
independence graph (CIG) corresponding to P. The reason is that an edge 
in a CIG requires only conditional dependence given the set V \ {i, j}. We 
conclude that if the true underlying probability mechanisms are generated 
from a DAG, it is more appropriate to use the undirected skeleton as a target 
than the undirected conditional independence graph. 
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2.2 The PC-algorithm for the skeleton 



A naive strategy would be to check conditional independencies given all sub- 
sets s C V \ {i,j} (see formula (fT]l). i.e. all partial correlations in the case 
of multivariate normal distributions. This would become computationally 
infeasible and statistically ill-posed for p larger than sample size. A much 
better approach is to use the PC-algorithm which is able to exploit sparse- 
ness of the graph. More precisely, we apply the part of the PC-algorithm 
that identifies the undirected edges of the DAG. 

2.2.1 Population Version 

In the population version of the PC-algorithm, we assume that perfect 
knowledge about all necessary conditional independence relations is avail- 
able. 

The PC pop (m)-algorithm 

1. Form the complete undirected graph C on the vertex set V. 

2. Set^ = -1; C = C 

a) repeat 

Increase i by one. 

b) repeat 

Select an ordered pair of nodes i,j that are adjacent in C such 
that \adj(C,i) \ {j}\ > i and k C adj{C,i) \ {j} with |k| = I. 
If i and j are conditionally independent given k, delete edge i, j. 
Denote this new graph by C. 

b) until all ordered pairs of adjacent variables i and j such that 
\adj(C, i) \{j}\ > t and k C adj(C, i) \{j} with |k| = I have been 
tested for conditional independence 

a) until t = m or 

for each ordered pair of adjacent nodes \adj(C,i) \ {j}\ < £. 
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This is the description of the population PC pop (m)-algorithm which is stopped 
at a pre-specified level m; the index I may not even reach m if the second 
statement for termination of 2a) applies. There is no need to tune the pa- 
rameter m when using the reached stopping level, 

m reach = maxjstopping level m; index i = m}. (2) 

The value of m reac h depends on the underlying distribution. 

Definition 1 (Population version) The PC pop - algorithm / flify ) is defined 
as the PCpop (m reac h)- algorithm. 

A proof that this algorithm produces the correct skeleton can be easily 
deduced from Theorem 5.1 in |13(]. We summarize the result as follows. 

Proposition 1 Consider a DAG G and assume that the distribution P is 
faithful to G. Denote the maximal number of neighbors by q = maxi<j< p |ne(j)| . 
Then, the PC pop - algorithm constructs the true skeleton of the DAG. More- 
over, for the reached stopping level: m reac h £ {q — l,q}. 

A proof is given in section H3 
2.2.2 Sample version for the skeleton 

For finite samples, we need to estimate conditional independencies. We 
limit ourselves to the Gaussian case, where all nodes correspond to random 
variables with a multivariate normal distribution. Furthermore, we assume 
faithful models, i.e. the conditional independence relations can be read of 
the graph and vice versa; see section 12.11 

In the Gaussian case, conditional independencies can be inferred from 
partial correlations. 

Proposition 2 Assume that the distribution P of the random vector X is 
multivariate normal. For i ^ j G {1, . . . ,p}, k C {1, . . . ,p] \ {i,j}, denote 
by /Ojj|k the partial correlation between XW and X^) given {XW; rek}. 
Then, Pij\k = if and only if X^) and X^) are conditionally independent 
given {X( r ); r G k}. 
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Proof: The claim is an elementary property of the multivariate normal dis- 
tribution, cf. 10, Prop. 5.2.]. □ 

We can thus estimate partial correlations to obtain estimates of condi- 
tional independencies. The sample partial correlation ftjik can be calculated 
via regression or recursively by using the following identity: for some /j6k, 

Pi,j\k\h ~ Pi,h\k\hPj,h\k\ h 



Pi,j\k 



1 - Plh^hX 1 ~ ft 



For testing whether a partial correlation is zero or not, we apply Fisher's 
z-transform 

zMk)4i°g(^*y (3) 

2 \ l -Pi,j\kj 

Classical decision theory yields then the following rule when using the sig- 
nificance level a. Reject the null-hypothesis Ho(i,j\k) : ftjik = against 
the two-sided alternative HA(i,j\k) : Pij\k / if \Jn — |k| — 3|Z(z, j'|k)| > 
- a/2), where $(•) denotes the cdf of Af(0, 1). 
The sample version of the PC-algorithm is almost identical to the pop- 
ulation version in section ?2.2. 11 except from step 2b). 

The PC(m)-algorithm 

Run the PC pop (m)-algorithm as described in section 12.2, II but replace 
in 2b) the statement about conditional independence of i, j given k by 
y/n- |k| -3\Z(i,j\k)\ < fc-^l - a/2), see ©. 

The algorithm yields a data-dependent value Th r eachn which is the maximal 
stopping level that is reached, i.e. the sample version of |2J). 

Definition 2 (Sample version) The PC-algorithm is defined as the PC(m reac h, n )- 
algorithm. 

As we will see in Theoreml^l the stopping level f^ireach^n 

provides a reasonable 

value for the stopping level m. The only tuning parameter of the PC- 
algorithm is a, i.e. the significance level for testing partial correlations. The 
algorithm seems to be rather insensitive to the choice of a, see section 

As we will see below in section the algorithm is asymptotically con- 
sistent even if p is much larger than n but the DAG is sparse. 
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3 Consistency for high-dimensional skeletons 



We will show that the PC-algorithm from section 12.2.21 is asymptotically 
consistent for the skeleton of a DAG, even if p is much larger than n but 
the DAG is sparse. We assume that the data are realizations of i.i.d. ran- 
dom vectors Xi, . . . , X n with Xj G W from a DAG G with corresponding 
distribution P. To capture high-dimensional behavior, we will allow to let 
the dimension grow as a function of sample size: thus, p = p n and also the 
DAG G = G n and the distribution P = P n . Our assumptions are as follows. 

(Al) The distribution P n is multivariate Gaussian and faithful to the DAG 
G n for all n. 

(A2) The dimension p n = 0(n a ) for some < a < oo. 

(A3) The maximal number of neighbors in the DAG G n is denoted by 
q n = maxi<j< Pn |ne(j)|, with q n = 0(n 1 ^ b ) for some < b < 1. 

(A4) The partial correlations between X^ and X^) given {XW;r G k} for 
some set k C {1, . . . ,p n }\{i, j} are denoted by p n; jj|k- Their absolute 
values are bounded from below and above: 

m f{k,j|kl; with pjjik / 0} > c n , c" 1 = 0{n d ), 

for some < d < 6/2, 

sup \Pi,j\k\ <M <1, 

n;i,jr',k 

where < b < 1 is as in (A3). 

Assumption (Al) is an often used assumption in graphical modeling, al- 
though it does restrict the class of possible probability distributions (see 
also third paragraph of section |2~T|) : (A2) allows for an arbitrary polynomial 
growth of dimension as a function of sample size, i.e. high-dimensionality; 
(A3) is a sparseness assumption and (A4) is a regularity condition. As- 
sumptions (A3) and (A4) are rather minimal: note that with b = 1 in (A3), 
e.g. fixed q n = q < oo, m n = m < oo, the partial correlations can decay 
as n~ l / 2+£ for any < e < 1/2. Our assumptions are simpler and seem to 
be weaker, although not directly comparable, than in [llj who analyze the 
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Lasso for estimating high-dimensional undirected conditional independence 
graphs (where the growth in dimensionality is as in (A2)). If the dimension 
p is fixed (with fixed DAG G and fixed distribution P), (A2), (A3) and (A4) 
hold and (Al) remains as the only condition. 

Theorem 1 Assume (Al), (A2), (A3) with < b < 1 and (A4) with 
< d < 6/2. Denote by G s kei,n(&n, m n) the estimate from the PC(m n )- 
algorithm in section \2.2.2\ and by G s k e i,n the true skeleton from the DAG 
G n . Moreover, denote by m reac h^ n the value described in f£J). Then, for 
m n > m rea ch,ni m n = 0(n 1 ~ b ) (n — ► oo), there exists a n — > (n — > oo) 
such that 

¥[Gakel,n( a n) m n) = G s k e l,n] 

= 1 - 0(exp(-Cn 1 ~ 2d )) ->■ 1 (n oo) /or some < C < cxd. 

A proof is given in section H3 The lower bound of the range for m n is 
^reach^n is either equal to qr n — 1 or g„, see Proposition^ i.e. it depends on 
the unknown sparseness q n in (A3). A non-constructive choice for the value 
of the significance level is a n = 2(1 — <I>(n 1 / 2 c n /2)) which depends on the 
unknown lower bound of partial correlations in (A4). 

Remark 1. For the case with fixed dimension p (with fixed DAG G and 
fixed distribution P) , Theorem ^ becomes: for any choice of m n > p — 
2, m n = o(n) (n — > oo) and using a n = 2(1 — <I>(D(nlog(n)~ 1 ) 1 / 2 )) for any 
< D < oo, 

= 1 — 0(exp(— Cn log(n) -1 )) — >• 1 (n — > oo) for some < C < oo. 

Remark 2. Denote by u n the minimal stopping level m such that the pop- 
ulation PC-algorithm PC pop (m) yields the true skeleton of the underlying 
DAG G. It is known that u n < maxi<j< Pn |pa(j)| , i.e. the maximal number 
of parents; this can be deduced from Theorem 5.1 in 13|. Moreover, Theo- 
rem^also holds for m n > u n , m n = 0(n 1_b ), and instead of (A3) it would 
suffice to require the weaker condition that u n = 0{n l ~ b ). The latter holds 
if the maximal number of parents satisfies maxi<j< Pn |pa(j)| = 0(n 1 ^ b ). 
The proof is as for Theorem ^ 
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Theorem ^ leaves some flexibility for choosing m n . The PC-algorithm 
yields a data-dependent reached stopping level rh reac h,n, i-e- the sample 
version of ©. 

Theorem 2 Assume (Al)-(A^). Then, 

^[m re ach,n = m reac h,n] = 1 _ 0(exp(-Cn 1_M )) -> 1 (n ->• oo) 
/or some < C < oo, 

where d > is as in (A4)- 

A proof is given in section H3 Because there are faithful distributions 
which require m n = m reac h )n £ {<Zn — 1, g^} for consistent estimation with 
the PC(m)-algorithm, Theorem |2] indicates that the PC-algorithm, stopping 
at rh reac h n , yields with high probability the smallest m = m n which is 
universally consistent for all faithful distributions. Therefore, there is no 
need to select a tuning parameter m = m n : the PC-algorithm yields a good, 
data-dependent rh reac h : n- 

Theorems n an d HI together yield the consistency of the PC-algorithm, 
i.e. the PC(m reac / i n )-algorithm. 

Corrolary 1 Assume (A1)-(A4)- Denote by G s keln( a n) the estimate from 
the PC-algorithm in section T2.2. 6 A and by G s }~ e i n the true skeleton from the 
DAG G n . Then, there exists a n — > (n — > oo) such that 

~P[G s kel,n(oc n ) = G s kel,n] 

= 1 - 0(exp(-Cn 1-M )) — >• 1 (n — >• oo) for some < C < oo, 

where d > is as in (A4)- 

Our theoretical framework allows for rather large values of p. The com- 
putational complexity of the PC-algorithm is difficult to evaluate exactly, 
but the worst case is bounded by 

0(p mreach ' n ) which is with high probability bounded by 0(p qn ) (4) 

as a function of dimensionality p. We note that the bound may be very 
loose for many distributions. Thus, for the worst case where the complexity 
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bound is achieved, the algorithm is computationally feasible if q n is small, 
say q n < 3, even if p is large. For non-worst cases, however, we can still do 
the computations for much larger values of q n and fairly dense graphs, e.g. 
some nodes j have neighborhoods of size up to |ne(j)| = 30. 

In practice, we can check the value of m reac h t n- As long as it is of "lower 
order" than sample size n, the PC-algorithm yields satisfactory results. 

4 Numerical examples 

We analyze the PC-algorithm and other alternative methods for the skele- 
ton using various simulated data. The numerical results have been obtained 
using the R-package pcalg (|9|) and the Bayes Net Toolbox of Kevin Mur- 
phy 

4.1 Simulating data 

In this section, we analyze the PC-algorithm for the skeleton using simulated 
data. 

In order to simulate data, we first construct an adjacency matrix A as 
follows: 

1. Fix an ordering of the variables. 

2. Fill the adjacency matrix A with zeros. 

3. Replace every matrix entry in the lower triangle (below the diagonal) 
by independent realizations of Bernoulli(s) random variables with suc- 
cess probability s where < s < 1. We will call s the sparseness of 
the model. 

4. Replace each entry with a 1 in the adjacency matrix by independent 
realizations of a Uniform([0.1, 1]) random variable. 

This then yields a matrix A whose entries are zero or in the range [0.1, 1]. 
The corresponding DAG draws a directed edge from node i to node j if i < j 
and Aji ^ 0. The DAGs (and skeletons thereof) that are created in this way 
have the following property: E[iVj] = s(p — 1), where Ni is the number of 
neighbors of a node i. 
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Thus, a low sparseness parameter s implies few neighbors and vice- versa. 
The matrix A will be used to generate the data as follows. The value of the 
random variable corresponding to the first node, is given by 

e (1) ~ iV(0,l) 
XW = e« 

and the values of the next random variables (corresponding to the next 
nodes) can be computed recursively as 

e (i) ~ N(0, 1) 
i-l 

X®=Y,MkXW+S (i = 2,...,p), 

k=l 

where all eW, . . . , are independent. 

4.2 Comparison with alternative methods 

In this section, we will compare the PC-algorithm with two alternative meth- 
ods, Greedy Equivalent Search (GES, see 0) and Maximum Weight Span- 
ning Trees (MWST, see 0]) which both try to find DAGs that maximize the 
BIC criterion. 

We found, that the BIC based methods find DAGs with high True Pos- 
itive Rate (TPR) but also rather high False Positive Rate (FPR). If only 
a small amount of observations is available (as is often the case in a very 
high-dimensional setting), we cannot hope to recover the complete underly- 
ing model. Therefore, instead of large TPR, we would rather prefer a subset 
of edges with high reliability. A measure for high reliability is the True 
Discovery Rate (TDR), which is the ratio of correctly found edges and the 
total number of all edges found. 

As can be seen in table HTT1 the PC-algorithm achieves in our simulations 
by far higher True Discovery Rates than GES or MWST: of all found edges, 
91% were correct. Thus, although a smaller total of edges was found, the 
estimated edges were correct more frequently. We think, that this is a 
substantial advantage for real world applications. 
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Method 


ave[TPR] 


ave[FPR] 


ave[TDR] 


PC 
GES 
MWST 


0.57 (0.06) 
0.85 (0.05) 
0.66 (0.07) 


0.02 (0.01) 
0.13 (0.04) 
0.06 (0.01) 


0.91 (0.05) 
0.71 (0.07) 
0.78 (0.06) 



Table 4.1: p = 10 nodes, sample size n = 50, sparseness s = 0.1, 50 repli- 
cates. Standard errors are given in parentheses. The PC-algorithm achieves 
a substantially higher True Discovery Rate than GES or MWST. 

4.3 Different parameter settings 



As introduced in section 12.2.21 the PC-algorithm has only one tuning pa- 
rameter a. In this section, we analyze the dependence of the algorithm on 
this parameter for different settings. 



a 


ave[TPR] 


ave[FPR] 


ave[TDR] 


ave[m rea ch\ 


0.001 
0.01 
0.05 
0.1 
0.3 


0.065 (0.002) 
0.089 (0.003) 
0.116 (0.003) 
0.128 (0.003) 
0.151 (0.005) 


0.0057 (0.0005) 
0.0082 (0.0007) 
0.0133 (0.0009) 
0.0161 (0.0010) 
0.0238 (0.0011) 


0.80 (0.02) 
0.78 (0.02) 
0.75 (0.02) 
0.73 (0.02) 
0.68 (0.02) 


2.56 (0.07) 
2.92 (0.06) 
3.26 (0.06) 
3.46 (0.08) 
4.28 (0.08) 


Table 4.2: p = 30, n = 20, s = 0.1, 50 replicates; s.e. in parentheses 


a 


ave[TPR] 


ave[FPR] 


ave[TDR] 


&ve[rh reach ) 


0.001 
0.01 
0.05 
0.1 
0.3 


0.069 (0.002) 
0.092 (0.002) 
0.116 (0.003) 
0.131 (0.003) 
0.159 (0.004) 


0.0056 (0.0005) 
0.0097 (0.0007) 
0.0141 (0.0008) 
0.0165 (0.0008) 
0.0233 (0.0010) 


0.80 (0.02) 
0.77 (0.02) 
0.73 (0.01) 
0.73 (0.01) 
0.70 (0.01) 


2.30 (0.07) 
2.92 (0.06) 
3.28 (0.07) 
3.50 (0.08) 
4.34 (0.07) 



Table 4.3: p = 30, n = 20, s = 0.4, 50 replicates; s.e. in parentheses. 



Tables El to show the average over 50 replicates of TPR, FPR, TDR 
and rh reac h for the DAG model in section |4~T1 with p = 30 nodes and varying 
sample size n and sparseness s. 

In the wide range of as, no choice can be identified as being the best or 
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a 


<we[TPR] 


we[FPR] 


&ve[TDR] 


ave[m rea ch] 


0.001 


0.153 (0.004) 


0.015 (0.001) 


0.77 (0.01) 


4.02 (0.07) 


0.01 


0.175 (0.005) 


0.017 (0.001) 


0.77 (0.01) 


4.38 (0.09) 


0.05 


0.193 (0.005) 


0.020 (0.001) 


0.76 (0.01) 


4.82 (0.08) 


0.1 


0.200 (0.005) 


0.021 (0.001) 


0.76 (0.01) 


5.00 (0.09) 


0.3 


0.221 (0.006) 


0.025 (0.001) 


0.74 (0.01) 


5.66 (0.09) 



Table 4.4: p = 30, n = 100, s = 0.1, 50 replicates; s.e. in parentheses. 



a 


&ve[TPR] 


ave[FPi?] 


ave[TDR] 


ave[in reach ] 


0.001 


0.155 (0.004) 


0.015 (0.001) 


0.78 (0.01) 


4.12 (0.08) 


0.01 


0.174 (0.004) 


0.016 (0.001) 


0.78 (0.01) 


4.54 (0.08) 


0.05 


0.188 (0.005) 


0.020 (0.001) 


0.76 (0.01) 


4.78 (0.09) 


0.1 


0.196 (0.005) 


0.021 (0.001) 


0.76 (0.01) 


4.92 (0.09) 


0.3 


0.217 (0.006) 


0.028 (0.001) 


0.71 (0.01) 


5.58 (0.10) 



Table 4.5: p = 30, n = 100, s = 0.4, 50 replicates; s.e. in parentheses. 



a 


&ve[TPR] 


&ve[FPR] 


ave[TDR] 


&ve{m reach ] 


0.001 


0.250 (0.007) 


0.033 (0.001) 


0.71 (0.01) 


6.5 (0.1) 


0.01 


0.258 (0.007) 


0.036 (0.001) 


0.70 (0.01) 


6.7 (0.1) 


0.05 


0.264 (0.007) 


0.038 (0.001) 


0.69 (0.01) 


7.0 (0.1) 


0.1 


0.268 (0.007) 


0.041 (0.001) 


0.68 (0.01) 


7.3 (0.1) 


0.3 


0.283 (0.007) 


0.047 (0.001) 


0.67 (0.01) 


7.6 (0.1) 



Table 4.6: p = 30, n = 5000, s = 0.1, 50 replicates; s.e. in parentheses. 



a 


<we[TPR] 


ave[FPi?] 


ave[TDR] 


ave[m reach ] 


0.001 


0.260 (0.007) 


0.031 (0.001) 


0.73 (0.01) 


6.40 (0.09) 


0.01 


0.268 (0.007) 


0.035 (0.001) 


0.72 (0.01) 


6.80 (0.09) 


0.05 


0.277 (0.006) 


0.036 (0.001) 


0.72 (0.01) 


7.04 (0.09) 


0.1 


0.281 (0.007) 


0.038 (0.001) 


0.71 (0.01) 


7.22 (0.10) 


0.3 


0.294 (0.006) 


0.045 (0.001) 


0.68 (0.01) 


7.70 (0.11) 



Table 4.7: p = 30, n = 5000, s = 0.4, 50 replicates; s.e. in parentheses. 
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worst. Especially in the case of very few observations we see that small a 
leads to the discovery of very few edges with high reliability (high TDR), 
whereas higher values of a lead to the discovery of more edges but with 
less reliability. Therefore, a can be used for fine tuning in finding a good 
compromise between amount of edges found and their reliability. 

Note, however, that especially for larger sample sizes, the rates vary 
only little, sometimes only by a few percent. Comparing this with the large 
change in a (over two orders of magnitude), we feel that the PC-algorithm 
is rather insensitive to the choice of its single tuning parameter. 



5 Conclusions 

The PC-algorithm is a powerful method for estimating the skeleton of a 
potentially very high-dimensional DAG with corresponding Gaussian distri- 
bution. Sparsity, in terms of the maximal size of the neighborhoods of the 
true underlying DAG, is crucial for statistical consistency (assumption (A3) 
and Theorem [J) and for computational feasibility with at most a polyno- 
mial complexity (see @) as a function of dimensionality. We prove consis- 
tency for high-dimensional frameworks under rather minimal assumption on 
sparseness and decay of non-zero partial correlations. 

The PC-algorithm compares well with alternative approaches like MWST 
and GES for low- or mid-dimensional problems. For high-dimensional set- 
tings, MWST and GES (with the implementations we used) become ex- 
tremely slow while the PC-algorithm is still computationally feasible; e.g. 
a polynomial algorithm for a sparse DAG, see ©. Software for the PC- 
algorithm will be made available in R, package pcalg (0]). 



6 Proofs 

6.1 Proof of Proposition [T] 

Consider X with distribution P. Since P is faithful to the DAG G, condi- 
tional independence of X" and given {X^ r ^; r £ k} (k C V \_H,j}) 
is equivalent to d-separation of nodes i and j given the set k (see 13j, Th. 



3.3]). Thus, the population PC pop - algorithm as formulated in section 12.2.11 
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coincides with the one from |l3| which is using the concept of d-separation, 
and the first claim about correctness of the skeleton follows from |13 . Th. 



The second claim about the value of m reac h can be proved as follows. 
First, due to the definition of the PC pop (m)-algorithm and the fact that it 
constructs the correct skeleton, m reac h < q. We now argue that m Teac h > 
q — 1. Suppose the contrary. Then, m reac /j < q — 2: we could then continue 
with a further iteration in the algorithm since m reac h + 1 < q — 1 and 
there is at least one node j with neighborhood-size |ne(j)| = q: that is, the 
reached stopping level would be at least q — 1 which is a contradiction to 

ni reach < q - 2. □ 

6.2 Proof of Theorem [T] 

6.2.1 Analysis of partial correlations 

We first establish uniform consistency of estimated partial correlations. De- 
note by pij and pi j the sample and population correlation between Xw 
and X^. Likewise, ftjik an d Pi,j\k denote the sample and population 
partial correlation between Xw and X( J ) given {X^ r ^;r G k}, where k C 
{!,-•• ,Pn} \ 

Many partial correlations (and non-partial correlations) are tested for 
being zero during the run of the PC(m n )-algorithm. For a fixed ordered 
pair of nodes i, j, the conditioning sets are elements of 



5.1., Ch. 13]. 



K™? ={kC{l,... A }\ {i,j} : |k| < m n } 



whose cardinality is bounded by 



< Bp™" for some < B < oo. 



(5) 



Lemma 1 Assume (Al) (without requiring faithfulness) andsup ni7 Lj \pn;i,j\ < 
M < 1 (compare with (A4))- Then, for any < 7 < 2, 



SUp P[\pn;i,j ~ Pn;i,j\ > H < C l( n ~ 2 ) ex P ( n ~ 4 ) lo g( 

for some constant < C% < 00 depending on M only. 
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Proof: We make substantial use of |S]'s work. Denote by f n (f,p) the proba- 
bility density function of the sample correlation p = p n +i-i,j based onn + 1 
observations and by p = p n +i-i t j the population correlation. (It is notation- 
ally easier to work with sample size n + 1; and we just use the abbreviated 
notations with p and p). For < 7 < 2, 

n\p-p\ >7i =p[p<p- 7 ]+p[p>p+7]- 



It can be shown, that f n (r, p) = f n (—r,—p), see |8|, p. 201]. This symmetry 
implies, 

V p [p<p-l] =Vp[p>p + 1 ] with ,5=-/?. (6) 

Thus, it suffices to show that P[/5 > p + 7] = P p [/5 > p + 7] decays exponen- 
tially in n, uniformly for all p. 

It has been shown (|8|, p. 201, formula (29)]), that for —1 < p < 1, 

P[p > p + 7] < p ) )r ^ 1 ) M (p + 7 )(1 + ^ ~T7) (7) 
V2 : 7rr(n+i) l-|p| 



with 



f 1 -3 1 

M (p + 7 )= / (1 -p 2 )f (1 -x 2 )V(l - /9X )-"+2 ( ix 

(1 — p ) 2 (1 — x ) 2 (1 — px) 2 (using n = n — 3) 

P+7 



(1- P 2 )f f 1 ( y/T=?y/T 



X 



2 



< v ry r / ( v r v ) n dx 

{l-\p\)2j P+1 l ~PX 



(l-|p|)f P+7<^<1 1-/KE 



We will show now that <?p(x) = — — < 1 for all p + 7 < 2; < 1 and 

— 1 < p < 1 (in fact, p < 1 — 7 due to the first restriction). Consider 



f v v / r^7yi-( P +7) 2 

sup 5 p (a;) = sup — ■ — r 

-Kp<l;p+7<x<l -l<p<l-7 L-PKP + 1) 



1 ■ ' < 1 for all < 7 <(2) 
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Therefore, for —1 < —M<p<M< 1 (see assumption (A4)) and using 

-I 

have 



Q-© together with the fact that r ^"i) < const, with respect to n, we 



P[p> p + 7 ] 
(n-l)r(n) (l-p 2 )f 4- 7 2 - 2 

- v^r(n + i)(i-| p |)l l 4 + 7 2 j 1 

(n-l)r(n) 1 2( 4-V fi _2_ 
" v / 2Tr(n + i)(l-M)l V 4 + 7 2J 1 1-M J " 

< d(n - l)(^4f = Clin ~ 1) exp((n - 3) M^J)), 
4 + 7 Z 4 + 7^ 



where < Ci < oo depends on M only, but not on p or 7. By invoking (jHJ), 
the proof is complete (note that the proof assumed sample size n + 1). □ 

Lemma H can be easily extended to partial correlations, as shown by 0], 
using projections for Gaussian distributions. 

Lemma 2 (Fisher, 1924) 

Assume (Al) (without requiring faithfulness) . If the cumulative distribution 
function of p n „i,j * s denoted by F(-\n, p n „i,j), then the cdf of the sample partial 
correlation /5 n; jj|k with |k| = m < n — 1 is F[-\n — m, Pn-ij^] ■ That is, the 
effective sample size is reduced by m. 



A proof can be found in [5|; see also □ 
Lemma n and |21 yield then the following. 

Corollary 1 Assume (the first part of) (Al) and (the upper bound in) (A4). 
Then, for any 7 > ; 

SUp P[|/9 n;i)J -| k - P„.;i,j\k\ > 

( 4- 7 2 
< Ci(n - 2 - m n ) exp (n - 4 - m n ) log(— — ^ 

V 4 + Y ' 

for some constant < C% < 00 depending on M from (A4) only. 
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The PC-algorithm is testing partial correlations after the z-transform 
g(p) = 0.51og((l + p)/(l - p)). Denote by Z n .^\ V = g(p n -ij\k) and by 

^n;i,j\k 9{Pn;i,j\k) • 

Lemma 3 Assume the conditions from Corollary^ Then, for any 7 > 0, 



sup P[|Z n;iJ | k - z n;i j| k | > 7] 

< 0(n - m n ) ( exp((n - 4 - m n ) log( ^ tttfw )) + exp( -C 2 ( /? - /// ,, ) ) 

2 , 



, 4-( 7 /£) 2 
•4+( 7 /L) 5 



/or some constant < C2 < 00 and L = 1/(1 — (1 + M) 2 /4). 

Proof: A Taylor expansion of the z-transform g{p) = 0.5 log((l + p) /(l — p)) 
yields: 

Zn;i,j\k ~ z n;i,j\k = 5 , '(Pn;ij|k)(Pn;ij|k _ Pn;ij|k)> (10) 

where |p n;i j| k - p n -,ij\k\ < \Pn;i,j\k ~ Pn;i,j\k\- Moreover, g'(p) = 1/(1 - p 2 ). 
By applying Corollary ^ with 7 = k = (1 — M)/2 we have 



sup P[|Pn;t,i|k - Pn;i,j|kl < K ] 

> 1 — Ci(n— 2 — m n )exp(— C 2 (n — m n )). (11) 

Since 

5 lPn;ij'|kJ 



1 - ^n;ij|k 1 - (PniiJlk + (Pn;i,j\k ~ P n] i,j\k)f 



- 1 — (Af + k) 2 lf ^ n ' i > j \ k Pn 'M\k\ - K > 

where we also invoke (the second part of) assumption (A4) for the last 
inequality. Therefore, since k = (1 — M)/2 yielding 1/(1 — (M + k) 2 ) = L, 
and using ((lip, we get 

SUP P[W (Pn;i,j\k)\ < L] 

> l-C 1 (n-2-m n )exp(-C 2 (n-m n )). (12) 
Since \g'(p)\ > 1 for all p, we obtain with (|l(J|k 
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sup P[\Z i j\ k -z ni jikl > 7] (13) 



< SUp P[b'(Pn;i,i|k)| > L] + SUp P[|Pn;i,i|k-Pn;t,i|k| > l/L}- 



Formula ()13|) follows from elementary probability calculations: for two 
random variables U,V with \U\ > 1 (\U\ corresponding to |<7'(p)| and |V| to 
the difference \p — p\), 

T > [\UV\>i\ = V[\UV\ >-y,\U\ > L] +F[\UV\ > 7,1 < \U\ < L] 

< nm >L]+n\v\ >i/l\. 

The statement then follows from ()13|) . (|12() and Corollary ^ □ 
6.2.2 Analysis of the PC(m)-algorithm 

The population version PC pop (m n )-algorithm when stopped at level m n = 
m reac h jn constructs the true skeleton according to Proposition ^ Moreover, 
the PC pop (m)-algorithm remains to be correct when using m > m reac h n - 
An error occurs in the sample PC-algorithm if there is a pair of nodes i,j 
and a conditioning set k G K™ 1 - 1 (although the algorithm is typically only 
going through a random subset of K™ p ) where an error event -E^k occurs; 

denotes that "an error occurred when testing partial correlation for 
zero at nodes i,j with conditioning set k". Thus, 

P[an error occurs in the PC(m n )-algorithm] 
< P[ [J E iM ] < 0(p^ +2 ) sup^ F[£^|k], (14) 



using that the cardinality of the set \{i,j, k G K™ n }\ = 0(p™" +2 ), see also 
formula Now 



E i,j\k ~ E ij\k U Elj\k, (15) 

where 



type I error E-^^ : ^ n - \k\ - 3|Zjj| k | > $ X (l - a/2) and 2^| k = 0, 
type II error £^|k : \/ n — 1^1 — 3|Zjj|k| < 3> _1 (1 — a/2) and £jj|k / 0. 
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Choose a = a n = 2(1 — <3?(n 1//2 c ra /2)), where c n is from (A4). Then, 

sup P[£/.-, k ] = sup P[|Z Mlk - z iJlk \ > (n/(n - |k| - 2,)) 1 ' 2 c n /2] 
ijteKff i j.k- k™j> 

< 0(n - m n ) exp(-C 3 (n - m n )c 2 ), (16) 

for some < C3 < oo using Lemma 01 and the fact that log(|^-) ~ —5 2 /2 
as S — > 0. Furthermore, with the choice of a = a n above, 

sup P[#/J, k ] = sup P[|%| k | < y/n/(n- |k| - 3)c/2] 
i,j,kex™" ,,,.k. /V"*," 

< sup P[|Zij|k - 2i,j|k| > c n (l - \Jnjin - |k| - 3)/2)], 

because infjj. kg ^™n |^j|k| > c n since > |p| for all p and using as- 

sumption (A4). By invoking Lemma 01 we then obtain: 

sup P[£/k] < 0(n - m n ) exp(-C 4 (n - m n )c 2 n ) (17) 

for some < C 4 < 00. Now, by ({13 )1 -1(17 )1 we get 

P[an error occurs in the PC(m n )-algorithm] 

< 0(vn n+2 (n - m m ) exp(-C 5 (n - m n )c 2 n )) 

< o(n a(m " +2)+1 exp(-C 5 (n - m n )n~ M )) 

= o(exp(a(m n + 2)log(n)+log(n)-C5(n^ M -m n n^ M ))) = o(l), 

because n 1 ~ 2d dominates all other terms in the argument of the exp- function 
due to the assumption in (A4) that d < 6/2. This completes the proof. □ 

6.3 Proof of Theorem [U 

Consider the population algorithm PC pop (m): the reached stopping level 
satisfies m reac h G {q n — L<?n} 5 see Proposition ^ The sample PC(m n )- 
algorithm with stopping level in the range of m reac i l < m n = 0(n 1 ~ b ), 
coincides with the population version on a set A having probability P[A] = 
1 — 0(exp(— Cn 1 ~ 2d )), see the last formula in the proof of Theorem^ Hence, 
on the set A, ih rea ch,n = breach S {q n — 1, Qn}- The claim then follows from 
Theorem ^ d 
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